Core Data Structures & Communication Primitives for Tensor Parallel for Keras #21697

buildwithsuhana · 2025-09-26T07:10:07Z

This Pull Request introduces the foundational components for a new, backend-agnostic auto-sharding system in Keras, specifically designed for tensor parallelism. It establishes the core data structures and the JAX-specific implementation of communication primitives.

Core Backend-Agnostic Abstractions
The most significant part of this PR is the creation of a generic, backend-agnostic system for defining sharding plans. This logic resides in keras/src/distribution/tensor_parallel/tensor_layout.py.
JAX-Specific Backend Implementation
This PR provides the first backend-specific implementation of the required distributed communication primitives,

Design Document: Autosharding for Keras

Example usage: https://colab.research.google.com/drive/1UAINIcstDuO0aeA9lxCF5LaIj5ne5X5z?resourcekey=0-pPF4COO19KRoqS5cpWNILA&usp=sharing

The full code of Tensor parallel for Keras has been devided into 4 PRs, this is the first PR for the same.

gemini-code-assist · 2025-09-26T07:10:32Z

Summary of Changes

Hello @buildwithsuhana, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the fundamental groundwork for introducing backend-agnostic auto-sharding and Tensor Parallelism into Keras 3.0. It establishes a modular and extensible architecture by defining core data structures, abstracting distributed backend functionalities, and providing high-level communication primitives. This initial set of changes is crucial for enabling future capabilities that will allow users to train very large models across multiple devices with significantly simplified code.

Highlights

Core Distributed Backend Abstraction: Introduced BaseDistributedBackend as an abstract interface for distributed operations and a get_distributed_backend factory function to provide a unified, backend-agnostic way to interact with JAX, TensorFlow, PyTorch, and NumPy distributed environments.
High-Level Communication Primitives: Defined AllReduceKeras, AllGatherKeras, BroadcastKeras, and ScatterKeras classes, which serve as high-level wrappers for essential collective communication operations required for tensor parallelism.
Tensor Sharding Actions: Implemented StateActionKeras as an abstract base class for defining how tensors are transformed for distribution. Concrete implementations like SplitKeras handle tensor sharding, while GatherKeras and SumKeras define how to reconstruct original tensors from their distributed parts.
Sharding Plan Configuration: Introduced the ConfigKeras dataclass to store and manage model-wide sharding rules and output configurations, including a mechanism to dynamically create collective operations based on these rules.
Tensor Parallel Communicator: Added TensorParallelCommunicator to orchestrate complex communication patterns for tensor parallelism, including specific methods for handling forward and backward passes in column-parallel and row-parallel operations, along with gradient slicing logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request lays a solid foundation for tensor parallelism in Keras by introducing backend-agnostic abstractions for distributed operations and core data structures for sharding. The overall design is well-structured, separating concerns between backend-specific implementations, communication primitives, and configuration. However, there are several areas that need attention, particularly regarding the correctness of some backend implementations (especially JAX), placeholder logic, API clarity, and code consistency. Addressing these points will strengthen the foundation and prevent issues in future development.

keras/src/backend/jax/distributed_backend.py

keras/src/backend/torch/distributed_backend.py

keras/src/distribution/tensor_parallel/config.py

keras/src/distribution/tensor_parallel/communications.py

keras/src/distribution/tensor_parallel/config.py

keras/src/distribution/tensor_parallel/state_action_keras.py

codecov-commenter · 2025-09-26T07:17:46Z

Codecov Report

❌ Patch coverage is 64.91228% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.53%. Comparing base (5ae5503) to head (f570925).
⚠️ Report is 26 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/backend/jax/core.py	27.27%	8 Missing ⚠️
keras/src/backend/jax/distribution_lib.py	22.22%	7 Missing ⚠️
keras/api/_tf_keras/keras/distribution/__init__.py	0.00%	2 Missing ⚠️
keras/src/distribution/distribution_lib.py	66.66%	2 Missing ⚠️
.../src/distribution/tensor_parallel/tensor_layout.py	96.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21697      +/-   ##
==========================================
- Coverage   82.59%   82.53%   -0.07%     
==========================================
  Files         572      573       +1     
  Lines       58327    58645     +318     
  Branches     9131     9169      +38     
==========================================
+ Hits        48177    48402     +225     
- Misses       7818     7917      +99     
+ Partials     2332     2326       -6

Flag	Coverage Δ
keras	`82.33% <64.91%> (-0.07%)`	⬇️
keras-jax	`63.06% <64.91%> (-0.25%)`	⬇️
keras-numpy	`57.61% <33.33%> (-0.04%)`	⬇️
keras-openvino	`34.48% <33.33%> (+0.16%)`	⬆️
keras-tensorflow	`63.91% <33.33%> (-0.13%)`	⬇️
keras-torch	`63.47% <33.33%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JyotinderSingh

I've added a few initial comments and questions during my first look.

To make the review more manageable, I propose we split this change up. At almost 1,800 lines, the current change is quite difficult to review properly. What do you think about limiting this PR to just the JAX backend, and introducing the others in subsequent, smaller PRs?

keras/src/backend/distributed/__init__.py

keras/src/backend/distributed/base.py

keras/src/backend/jax/distributed_backend.py

keras/src/backend/jax/distributed_backend_test.py

keras/src/backend/numpy/distributed_backend.py

keras/src/backend/jax/distributed_backend_test.py

keras/src/distribution/tensor_parallel/config.py

…uhana/keras into Tensor_parallel_keras

hertschuh

Thank you for the PR!

Some high level comments:

Out of context, it's really hard for me to understand why these abstractions are needed for Tensor Parallel.
- Why do we need all these primitives?
- Why do we need 3 layers of abstraction for the same concepts: the communications layer, the state_actions layer and the keras.distributed.get_communication_ops layer? Can we just have one?
These abstraction look Torch-like and not JAX-like. On JAX you never have to manually split and do an all-gather, you simply shard. You never have to explicitly have to do a "collective sum". You just do a sum, and if the tensors are sharded, it will magically do all the needed collectives for you. So it's unclear to me why any of these are needed for JAX.
I wouldn't export these symbols that you added to keras.distributed, I don't think they are needed. What we'll expose is the "Tensor Parallel" API.
For the better or worse, we don't do type annotations in Keras. And unfortunately, mixing code with type annotations with code without type annotation doesn't work well. It's better to not have any type annotations at all.

keras/src/backend/jax/distributed_backend.py

keras/src/distribution/tensor_parallel/tensor_layout.py

hertschuh · 2025-10-20T23:05:16Z

keras/src/backend/jax/distribution_lib.py

+    return jax.local_device_count()
+
+
+def get_best_devices(count=1):


I don't think this is needed. It's only used here:

def _auto_detect_parallelism(self): """Auto-detects the number of available devices and sets world size.""" from keras.src.distribution import get_best_devices available_devices = list_devices() world_size = len(available_devices) device_ids = get_best_devices(world_size) return world_size, device_ids

Which you should replace with:

def _auto_detect_parallelism(self): available_devices = list_devices() return len(available_devices), available_devices

buildwithsuhana added 2 commits September 26, 2025 12:23

Added tensor parallel for keras (Part 1/3)

a27367a

Removed unnecessary lines

488cd8f

google-ml-butler bot added the size:XL label Sep 26, 2025

google-ml-butler bot assigned gbaned Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

buildwithsuhana added 16 commits September 26, 2025 13:14

Fixes suggested by Gemini

71ddd1a

Fixes suggested by Gemini

bc4e4e2

Fixes suggested by Gemini

d4200b5

Fixes suggested by Gemini

21f89a2

Fixes suggested by Gemini

299bd45

Fixes suggested by Gemini

da625e1

Fixing the failing test

c233b8c

Fixing the failing test

7b8d733

Fixing test

f825cd3

Adding tests for distributed_backends

3725180

Modifications for failing tests

a6c8a96

Modified for failing test

3fabfde

Modified for failing test

b133752

Modified for failing test

83c2e3f

added debuggers

3f3be6b

removed debuggers

be325ab

JyotinderSingh suggested changes Sep 29, 2025

View reviewed changes

JyotinderSingh self-requested a review September 29, 2025 11:29

google-ml-butler bot added the awaiting review label Sep 29, 2025

JyotinderSingh suggested changes Sep 29, 2025

View reviewed changes

keras/src/distribution/tensor_parallel/config.py Outdated Show resolved Hide resolved

keras/src/distribution/tensor_parallel/config.py Outdated Show resolved Hide resolved

buildwithsuhana added 3 commits September 29, 2025 19:14

Merge branch 'keras-team:master' into Tensor_parallel_keras

e1282ac

Removed the tensorflow, numpy and torch backends

fc11aaa

Merge branch 'Tensor_parallel_keras' of https://github.com/buildwiths…

ef6e2a0

…uhana/keras into Tensor_parallel_keras

buildwithsuhana added 5 commits October 3, 2025 12:35

Fixing tests

af711fd

formatting

97dde17

fixing test

f322a97

fixing test

5269ac9

Removing redundant lines

b9f36e9

abheesht17 requested review from amitsrivastava78 and hertschuh October 7, 2025 05:04

hertschuh reviewed Oct 7, 2025

View reviewed changes

buildwithsuhana added 8 commits October 12, 2025 10:22

Refactoring to remove communications.py and state_action_keras.py

555e5c9

formatting the files

b80d264

fixing skip issues

93b1738

fixing test

b7b2b9b

fixing test

f6c1142

refactoring to remove distributed backend wrapper

669c799

fixing test

cd20b9f

making distrubed backend more jax friendly

cd0049f

buildwithsuhana requested a review from hertschuh October 13, 2025 06:05

hertschuh reviewed Oct 15, 2025

View reviewed changes

buildwithsuhana added 9 commits October 18, 2025 00:45

Fixing comments

d1e4c69

Fixing comments

86e0557

Fixing comments

6c3883f

fixes

3e31e1e

Refactor

c99601e

refactoring to resolve comments

dbae56d

fixes

2fc0f0e

fixes

174093c

fix

7d18b0a

buildwithsuhana requested a review from hertschuh October 18, 2025 07:02

fix

f570925

hertschuh reviewed Oct 20, 2025

View reviewed changes

		return jax.local_device_count()


		def get_best_devices(count=1):

Core Data Structures & Communication Primitives for Tensor Parallel for Keras #21697

Are you sure you want to change the base?

Core Data Structures & Communication Primitives for Tensor Parallel for Keras #21697

Uh oh!

Conversation

buildwithsuhana commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JyotinderSingh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hertschuh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hertschuh Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

buildwithsuhana commented Sep 26, 2025 •

edited

Loading

codecov-commenter commented Sep 26, 2025 •

edited

Loading